MENU

This topic will discuss the mathematics of Eigenvectors and Eigenvalues, Spectral Decomposition Theorem, and Principal Component Analysis. Moreover, I will show an application of this technique.

EIGENVECTORS AND EIGENVALUES

Definition 1. Let a non-null vector v (i.e. \vec{v} \neq 0), v \in V (vector space) is said a eigenvectors of T if there exist a real number \lambda which T(v) = \lambda v. The \lambda scalar is denominated by an eigenvalue of T associated with v. It can be concluded that v and T have the same support line (i.e. the linear transformation T maps vector v into the same direction):

\begin{aligned}
  T: v \rightarrow v \\[5px]
  \Rightarrow
  T(\vec{v}) = \lambda \vec{v} \\[5px]
  \Rightarrow A\vec{v} = \lambda\vec{v}
  \end{aligned}

Preposition 1. The eigenvectors of A are precisely the roots of the characteristic polynomial of A.

A \in \mathcal{M}_{p}(\mathbb{R}) or A \in \mathbb{R}^{p \times p} of transformation T: \mathbb{R}^{p} \rightarrow \mathbb{R}^{p}. Therefore

\lambda \in \mathbb{R} is eigenvalues of A.

\Leftrightarrow \exists \vec{v} \neq 0 such that A\vec{v} = \lambda\vec{v}

\Leftrightarrow A\vec{v} - \lambda\vec{v} = 0 where AI = A

\Leftrightarrow (A - \lambda I)\vec{v} = 0

in which I is identity matrix. So this system has a non-trivial solution if, and only if, the determinant of the matrix (A - \lambda I) is zero (a consequence of Cramer’s Rule),

\begin{cases}
      \vec{v} \neq 0\\
      \det(A - \lambda I) = 0
    \end{cases}

In this manner, (A - \lambda I) is non-invertible and singular. If \vec{v} \neq 0(A - \lambda I)^{-1} = 0, so (A - \lambda I) is invertible it has a trivial solution and is not what we want!

Example: Let A is a matrix, and now we will find its eigenvectors and eigenvalues.

\begin{bmatrix}
 2 & -1 \\
-1 &  2 
\end{bmatrix}

and knowing that det(A - \lambda I) =0. Then,

\begin{bmatrix}
 2 - \lambda & -1 \\
-1 &  2 - \lambda 
\end{bmatrix} = (2 - \lambda)(2 - \lambda) - 1 = 0 \lambda^2 - 4 \lambda + 3 = 0 (\lambda - 3)(\lambda - 1) = 0

The solutions for the eigenvalues are \lambda = (3, 1)^{\top}. For the eigenvector equal to \lambda_{1} = 3, we have

\begin{bmatrix}
 2 - 3 & -1 \\
-1 &  2 - 3 
\end{bmatrix} = \begin{bmatrix}
 v_{1}  \\
 v_{2}
\end{bmatrix} = \vec{0}

\begin{bmatrix}
 -1 & -1 \\
-1 &  -1 
\end{bmatrix} = \begin{bmatrix}
 v_{1}  \\
 v_{2}
\end{bmatrix} = \vec{0}

\begin{cases}
      - v_{1} - v_{2} = 0\\
      - v_{1} - v_{2} = 0
    \end{cases} \Rightarrow  v_{1} = - v_{2} (- v_{2}, v_{2})^{\top} = v_{2}(- 1, 1)^{\top} \lambda = 3 \sim \text{span}\bigg\{\begin{bmatrix}
 -1  \\
 1
\end{bmatrix}\bigg\} \Rightarrow \vec{v} = (-1, 1)^{\top}

For the eigenvector equal to \lambda_{1} = 1, we have

\begin{bmatrix}
 2 - 1 & -1 \\
-1 &  2 - 1 
\end{bmatrix} = \begin{bmatrix}
 v_{1}  \\
 v_{2}
\end{bmatrix} = \vec{0}

\begin{bmatrix}
 1 & -1 \\
-1 &  1 
\end{bmatrix} = \begin{bmatrix}
 v_{1}  \\
 v_{2}
\end{bmatrix} = \vec{0}

\begin{cases}
        v_{1} - v_{2} = 0\\
      - v_{1} + v_{2} = 0
    \end{cases} \Rightarrow  v_{1} = v_{2} ( v_{2}, v_{2})^{\top} = v_{2}(1, 1)^{\top} \lambda = 1 \sim \text{span}\bigg\{\begin{bmatrix}
 1  \\
 1
\end{bmatrix}\bigg\} \Rightarrow \vec{v} = (1, 1)^{\top}

and normalizing the vectors to length 1, known as the unit vector; they are

\frac{\vec{v}_{1}}{||\vec{v}_{1}||} = \begin{bmatrix}
 -1/\sqrt{2} \\
 1/\sqrt{2}
\end{bmatrix}

\frac{\vec{v}_{2}}{||\vec{v}_{2}||} = \begin{bmatrix}
 1/\sqrt{2} \\
 1/\sqrt{2}
\end{bmatrix}

SPECTRAL DECOMPOSITION THEOREM

Reminder: I will discuss only the Spectral Decomposition Theorem, but other methods could be used as Singular value decomposition.

An asymmetric matrix is a matrix that is equal to its transpose. Thus,

A \in \mathcal{M}_{p}(\mathbb{R}) \Rightarrow A^{\top} \in \mathcal{M}_{p}(\mathbb{R}) \Rightarrow A = A^{\top}

Spectral Theorem (Symmetric matrix): Let A \in \mathbb{R}^{p \times p}, then

  1. The eigenvalues \lambda_{1}, \dots, \lambda_{p} are real;
  2. The eigenvectors of A associated with different eigenvalues are orthogonal to each other. Then:

    \begin{aligned}
<\vec{v}_{i}, \vec{v}_{j}> &= \vec{v}_{i}^{\top} \vec{v}_{j} \\[10pt]
 &=  \sum_{t=1}^{p} \vec{v}_{ti}\vec{v}_{tj}  \\[10pt]
 &= \vec{v}_{1i}\vec{v}_{1j} + \dots+ \vec{v}_{pi}\vec{v}_{pj} \\[10pt]
 &= 0
\end{aligned}

    with \vec{v}_{j} = (v_{1j}, \dots, v_{pj})^{\top} \in \mathbb{R}^{p \times 1}, i \neq j and i, j, k = 1, \dots, p;

    Preposition 2. An orthogonal set of non-zero vectors is linearly independent.

    Reminder:

    Linearly independent \nrightarrow Orthogonal

    Linearly independent \leftarrow Orthogonal

  3. It is always possible to obtain a basis \{e_{1}, \dots, e_{p}\} of orthonormal eigenvectors. Then:

    \begin{matrix}
 \cfrac{\vec{v}_{1}}{||\vec{v}_{1}||} = (e_{11}, e_{21}, \dots, e_{p1})^{\top} = e_{1} \\
  \cfrac{\vec{v}_{2}}{||\vec{v}_{2}||} = (e_{12}, e_{22}, \dots, e_{p2})^{\top} = e_{2} \\
  \vdots\\
 \cfrac{\vec{v}_{p}}{||\vec{v}_{p}||} = (e_{1p}, e_{2p}, \dots, e_{pp})^{\top} = e_{p}
\end{matrix}

    e_{i}^{\top} e_{j} = \delta_{ij} =   \begin{cases}
      0 & \text{if } i \neq j\\
      1 & \text{if } i = j
    \end{cases}

    and i, j = 1, \dots, p;

  4. The relation is valid P^{-1}AP = D, so P is defined as follows

    P_{p \times p} =\begin{bmatrix}
 e_{.1} & \dots & e_{.p} \\
\end{bmatrix} = 
\begin{bmatrix}
 e_{11} &  e_{12} & \dots & e_{1p} \\
 e_{21} &  e_{22} & \dots & e_{2p} \\
 \vdots &  \vdots & \ddots & \vdots \\
 e_{p1} &  e_{p2} & \dots & e_{pp} \\
\end{bmatrix}

    where P is a orthonormal eigenvectors matrix.

    Reminder: Orthonormal eigenvectors matrix has the following properties P^{\top}P = I = PP^{\top} and P^{-1} = P^{\top}.

    and D matrix is defined by

    D_{p \times p} =
\begin{bmatrix}
\lambda_{1} &         &       & \\
        &  \lambda_{2} &       & \\
        &         &\ddots & \\
        &         &       & \lambda_{p}\\
\end{bmatrix}

    where D is a diagonal matrix in which eigenvalues are diagonal elements;

Definition 2. A quadratic form in \mathbb{R}^{n} is a function Q: \mathbb{R}^{p} \rightarrow \mathbb{R} of form Q(y) = y^{\top}Ay.

  1. A is positive definite \Leftrightarrow If Q(y) > 0, \forall y \in \mathbb{R}_{\small\setminus0}^{p}.
  2. A is positive semidefinite \Leftrightarrow If Q(y) \geq 0, \forall y \in \mathbb{R}_{\small\setminus0}^{p}.

Hence

  1. A is positive definite, so \lambda_{j} > 0 for all j = 1, \dots, p.
  2. A is positive semidefinite, so \lambda_{j} \geq 0 for all j = 1, \dots, p.

Important: If A \in \mathbb{R}^{p \times p} and positive definite.

  1. A is non-singular (\text{det}(A) > 0). Exemplo:

    A = \begin{bmatrix}
            a & b \\
            b & d \\
          \end{bmatrix} \Rightarrow \text{det}(A) = ad - bb

    \Rightarrow A^{-1} = \cfrac{1}{\text{det}(A)}\begin{bmatrix}
            d & - b \\
            - b & a \\
          \end{bmatrix}

  2. A^{-1} is positive definite.

If A \in \mathbb{R}^{p \times p} and positive semidefinite.

  1. All elements on the main diagonal are non-negative and \mathrm{tr}(A) \geq 0.
  2. All principal minors A are non-negative.

COVARIANCE MATRIX

The covariance matrix is

\Sigma = \mathrm{E}((X - \mathrm{E}(X))(X - \mathrm{E}(X))^{\top})

where X is a matrix of random variables from a sample; x_{l} = (x_{l1}, \dots, x_{lp})^{\top} is a row vector of X with l = 1, \dots, n; and sample mean vector

\overline{x} = \sum_{l = 1}^{n}\cfrac{x_{l}}{n} = (\overline{x}_{1}, \dots, \overline{x}_{p})^{\top}

Therefore, the sample covariance matrix is

S = \cfrac{1}{n - 1} \sum_{l = 1}^{n} (x_{l} - \overline{x})(x_{l} - \overline{x})^{\top}

Now let \underset{ p \times 1}{e_{j}} = e_{.j} = (e_{1j}, \dots, e_{pj})^{\top} \in \mathbb{R}_{\small\setminus0}^{p} be any column vector. Thus

e_{j}^{\top}Se_{j} = e_{j}^{\top}\bigg( \cfrac{1}{n - 1} \sum_{l = 1}^{n} (x_{l} - \overline{x})(x_{l} - \overline{x})^{\top} \bigg)e_{j}

how e_{j} does not dependent on l, we can move into the summation

e_{j}^{\top}Se_{j} = \bigg( \cfrac{1}{n - 1} \sum_{l = 1}^{n} \underset{1 \times p}{e_{j}^{\top}}\underset{p \times 1}{(x_{l} - \overline{x})}\underset{1 \times p}{(x_{l} - \overline{x})^{\top}}\underset{p \times 1}{e_{j}} \bigg)

e_{j}^{\top}Se_{j} = \cfrac{1}{n - 1} \sum_{l = 1}^{n} \Big( (x_{l} - \overline{x})^{\top}e_{j} \Big)^{2} \geq 0

Thereby, the covariance matrix will always be a positive semidefinite matrix. Notice that e_{j}^{\top}Se_{j} will only be zero whether (x_{l} - \overline{x})^{\top}e_{j} = 0 for all l = 1, \dots, n.

Another point that you need to take in mind is \text{det}(S) = 0, which means e_{j}^{\top}Se_{j} = 0; in this situation, S is a singular matrix, i.e., there is no inverse. In other words, S is ill-conditioned, and in my opinion, you need to begin trying if there is a collinearity problem. Or we can use Moore-Penrose generalized inverse to have a pseudoinverse matrix.

PRINCIPAL COMPONENT ANALYSIS

The PCA is a technique that will find directions in which the projection of data has the largest variance. The first direction is, by definition, called the first principal direction. To find the first one consider the following optimization problem e_{1} \in \mathbb{R}^{p}

\begin{aligned}
& \underset{e_{1} \neq 0}{\text{max}}
& & \mathrm{Var}(X e_{1}) = e_{1}^{\top}Se_{1} \\
& \text{subject to} & & ||e_{1}||^{2} = 1,  \\
\end{aligned}

Using the formalization of the Lagrange multiplier method to e_{1}, we have

L = e_{1}^{\top}Se_{1} - \lambda_{1}(e_{1}^{\top}e_{1} - 1)

\begin{aligned}
\frac{\partial L}{\partial e_{1}} &= 2Se_{1} - 2\lambda_{1}e_{1} \Leftrightarrow 
e_{1}^{\top}Se_{1} = \lambda_{1} \\
 \\
\frac{\partial L}{\partial \lambda_{1}} &= e_{1}^{\top}e_{1} - 1 = 0  \Leftrightarrow e_{1}^{\top}e_{1} = 1\\
\end{aligned}

It is not tough to see that \lambda_{1} is the maximum variance for \underset{ n \times 1}{Y_{1}} = \underset{ n \times p}{(X - \overline{x})} \underset{ p \times 1}{e_{1}}.

Now, we need to find the second direction called the second principal direction. Consider the following optimization problem e_{2} \in \mathbb{R}^{p}

\begin{aligned}
& \underset{e_{2} \neq 0}{\text{max}}
& & \mathrm{Var}(X e_{2}) = e_{2}^{\top}Se_{2} \\
& \text{subject to} & & ||e_{2}||^{2} = 1,  \\
&                   & & e_{1}^{\top}e_{2} = 0, 
\end{aligned}

Using the formalization of the Lagrange multiplier method, we have

L = e_{2}^{\top}Se_{2} - \lambda_{2}(e_{2}^{\top}e_{2} - 1) - \phi e_{2}^{\top}e_{1}

\begin{aligned}
\frac{\partial L}{\partial e_{2}} &= 2Se_{2} - 2\lambda_{2}e_{2} - \phi e_{1} = 0 \\
 \\
\frac{\partial L}{\partial \lambda_{2}} &= e_{2}^{\top}e_{2} - 1 = 0 \Leftrightarrow  e_{2}^{\top}e_{2} = 1\\
\\
\frac{\partial L}{\partial \phi} &=  e_{2}^{\top}e_{1} = 0  \\
\end{aligned}

So, we are going to find the value of \lambda_{2} and \phi; then the first one will be \phi

\begin{aligned}
2Se_{2} - 2\lambda_{2}e_{2} - \phi e_{1} &= 0 \\
2e_{1}^{\top}Se_{2} - 2\lambda_{2}e_{1}^{\top}e_{2} - \phi e_{1}^{\top}e_{1} &= 0 \\
2\lambda_{1}e_{1}^{\top}e_{2} - \phi &= 0\\
\phi &= 0
\end{aligned}

where e_{1}^{\top}e_{1} = 1, e_{1}^{\top}e_{2} = 0, \ \mbox{and} \ e_{1}^{\top}S = \lambda_{1}e_{1}^{\top}; and to \lambda_{2}

\begin{aligned}
\frac{\partial L}{\partial e_{2}} &= 2Se_{2} - 2\lambda_{2}e_{2} - \phi e_{1} = 0 \Leftrightarrow  e_{2}^{\top}Se_{2} = \lambda_{2}\\
 \\
\end{aligned}

Now, \lambda_{2} is the second maximum variance for \underset{ n \times 1}{Y_{2}} = \underset{ n \times p}{(X - \overline{x})} \underset{ p \times 1}{e_{2}}.

The results above can be extrapolated to k principal components, where k \leq p. Then

\begin{aligned}
\underset{ n \times 1}{Y_{1}} = \underset{ n \times p}{(X - \overline{x})} \underset{ p \times 1}{e_{1}} &\Leftrightarrow \mathrm{Var}(Y_{1}) = \mathrm{Var}(Xe_{1}) = e_{1}^{\top}Se_{1} = \lambda_{1}\\
\underset{ n \times 1}{Y_{2}} = \underset{ n \times p}{(X - \overline{x})} \underset{ p \times 1}{e_{2}} &\Leftrightarrow  \mathrm{Var}(Y_{2}) = \mathrm{Var}(Xe_{2}) = e_{2}^{\top}Se_{2} = \lambda_{2}\\
 \vdots \\
 \underset{ n \times 1}{Y_{k}} = \underset{ n \times p}{(X - \overline{x})} \underset{ p \times 1}{e_{k}} &\Leftrightarrow  \mathrm{Var}(Y_{k}) = \mathrm{Var}(Xe_{k}) = e_{k}^{\top}Se_{k} = \lambda_{k}\\
\end{aligned}

and

\begin{aligned}
\mathrm{Cov}(Y_{i}, Y_{j}) = e_{i}^{\top}Se_{j} = \lambda_{i}e_{i}^{\top}e_{j} = 0, \ \forall \ i \neq j= 1, \dots, k \leq p \\
\end{aligned}

where \lambda_{1} \geq \lambda_{2} \geq \dots \geq \lambda_{k} and Y_{i} \perp\!\!\!\!\perp Y_{j} are dependent on each other. Another point to note is

\begin{aligned}
\sum_{t = 1}^{p} \mathrm{Var}(\textbf{x}_{t}) = \sum_{t = 1}^{p} \mathrm{Var}(Y_{t}) \Leftrightarrow  \sum_{t = 1}^{p} S_{tt} = \sum_{t = 1}^{p} \lambda_{t}\\
\end{aligned}

in which X = [\textbf{x}_{1} \ \dots \ \textbf{x}_{p}].

Important: If you want to work without centering the data (\underset{ n \times p}{X}) on the mean, the covariance matrix must be equal to S = X^{\top}X / (n - 1).

\begin{aligned}
\text{Information of}\ Y_{j} : \cfrac{\lambda_{j}}{\sum_{t = 1}^{p} \lambda_{t}}\times 100
\end{aligned}

Of course, the first component will retain the most information on the variance of original data, and the second component will have the second most information on the variance of original data; in this way, it will follow until the p component, which the last one will have the smallest information on variance of original data.

Therefore, k is the component number chosen by the researcher. It is common to have a situation where the total sample variance of original data can be explained from 80% to 90% by the first k principal components. That is why we can replace the original p-variables with the k component without much loss of information.

Now, I will introduce the correlation coefficient between Y_{j} and \textbf{x}_{i} for all i, j = 1, \dots, p. Then

\begin{aligned}
{\Large r}_{\small Y_{j},\textbf{x}_{i}} = \cfrac{e_{ji}\sqrt{\lambda_{j}}}{\sqrt{S_{ii}}}
\end{aligned}

With the measure of the correlation between Y and x, you might decide which variable may stay in this specific component or remove. There is no magical formula that you might use to decide whether to leave or remove this specific variable. To me, if |{\Large r}| < 0.25, you perhaps need to think about how important this variable is in this component. And so, you decide.

Generally, the principal component varies with the transformation on the scales. For example, imagine there is a variable \textbf{x}_{1} \in (0, 1), but there is also a variable \textbf{x}_{2} > 1000 (minimum value). So, the principal component tends to give more importance to \textbf{x}_{2} than \textbf{x}_{1} just because of the difference in these scales. But it might not be the truth. In this case, you need to standardize your data.

Assuming that,

U = \begin{bmatrix}
            S_{11} &        & &0\\
                   & S_{22} & &\\
                   &        & \ddots &\\
            0       &        &        &S_{pp} \\
          \end{bmatrix} \Rightarrow 
U^{-1/2} = \begin{bmatrix}
           \cfrac{1}{\sqrt{S_{11}}}  &        & &0\\
                   & \cfrac{1}{\sqrt{S_{22}}} & &\\
                   &        & \ddots &\\
            0       &        &        & \cfrac{1}{\sqrt{S_{pp}}} \\
          \end{bmatrix}

so

\begin{aligned}
Z &= \underset{ n \times p}{(X - \overline{x})} \underset{ p \times p}{U^{-1/2}}\\
\\
\mathrm{Cov}(Z) &= U^{-1/2} \mathrm{Cov}(X) U^{-1/2} = U^{-1/2} S U^{-1/2} = R\\
\\
S &= U^{-1/2} R U^{-1/2} \\
\end{aligned}

where R is correlation matrix of X. In this way, working with the standardized data is the same as with the correlation matrix. To standardized data, then

\begin{aligned}
\text{Principal component:} &\ \ \ \underset{ n \times 1}{Y_{j}} = \underset{ n \times p}{Z} \underset{ p \times 1}{e_{j}} \Leftrightarrow \mathrm{Var}(Y_{j}) = \lambda_{j}\\
\\
\text{Total variance:} &\ \ \ \sum_{t = 1}^{p} \mathrm{Var}(\textbf{z}_{t}) = \sum_{t = 1}^{p} \mathrm{Var}(Y_{t}) \Leftrightarrow  \sum_{t = 1}^{p} R_{tt} = \sum_{t = 1}^{p} \lambda_{t} = p\\
\\
\text{Information of}\ Y_{j} :& \ \ \ \cfrac{\lambda_{j}}{p}\times 100 \\
\\
\text{Correlation coefficient:}&\ \ \ {\Large r}_{\small Y_{j},\textbf{x}_{i}} = e_{ji}\sqrt{\lambda_{j}}
\end{aligned}

Everything that has already been said for the covariance matrix also applies to the correlation matrix.

Data application will come soon!

REFERENCES

DeGroot, Morris H, and Mark J Schervish. 2012. Probability and Statistics. Pearson Education.
Hair, Joseph F. 2009. “Multivariate Data Analysis.”
Johnson, Richard Arnold, Dean W Wichern, et al. 2014. Applied Multivariate Statistical Analysis. Vol. 6. Pearson London, UK:
R Core Team. 2021. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Rencher, Alvin C. 2002. “Méthods of Multivariate Analysis. A John Wiley & Sons.” Inc. Publication 727.
Create a front page